{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Custom TF-Hub Word Embedding with text2hub\n",
    "\n",
    "**Learning Objectives:**\n",
    "  1. Learn how to deploy AI Hub Kubeflow pipeline.\n",
    "  1. Learn how to configure the run parameters for text2hub.\n",
    "  1. Learn how to inspect text2hub generated artifacts and word embeddings in TensorBoard.\n",
    "  1. Learn how to run TF 1.x generated hub module in TF 2.0.\n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "\n",
    "Pre-trained text embeddings such as TF-Hub modules are a great tool for building machine learning models for text features, since they capture relationships between words. These embeddings are generally trained on vast but generic text corpora like Wikipedia or Google News, which means that they are usually very good at representing generic text, but not so much when the text comes from a very specialized domain with unique vocabulary, such as in the medical field.\n",
    "\n",
    "\n",
    "One problem in particular that arises when applying a TF-Hub text module which was pre-trained on a generic corpus to specialized text is that all of the unique, domain-specific words will be mapped to the same “out-of-vocabulary” (OOV) vector. By doing so we lose a very valuable part of the text information, because for specialized texts the most informative words are often the words that are very specific to that special domain. Another issue is that of commonly misspelled words from text gathered from say, customer feedback. Applying a generic pre-trained embedding will send the misspelled word to the OOV vectors, losing precious information. However, by creating a TF-Hub module tailored to the texts coming from that customer feedback means that common misspellings present in your real customer data will be part of the embedding vocabulary and should be close by closeby to the original word in the embedding space.\n",
    "\n",
    "\n",
    "In this notebook, we will learn how to generate a text TF-hub module specific to a particular domain using the text2hub Kubeflow pipeline available on Google AI Hub. This pipeline takes as input a corpus of text stored in a GCS bucket and outputs a TF-Hub module to a GCS bucket. The generated TF-Hub module can then be reused both in TF 1.x or in TF 2.0 code by referencing the output GCS bucket path when loading the module. \n",
    "\n",
    "Our first order of business will be to learn how to deploy a Kubeflow pipeline, namely text2hub, stored in AI Hub to a Kubeflow cluster. Then we will dig into the pipeline run parameter configuration and review the artifacts produced by the pipeline during its run. These artifacts are meant to help you assess how good the domain specific TF-hub module you generated is. In particular, we will  explore the embedding space visually using TensorBoard projector, which provides a tool to list the nearest neighbors to a given word in the embedding space.\n",
    "\n",
    "\n",
    "At last, we will explain how to run the generated module both in TF 1.x and TF 2.0. Running the module in TF 2.0 will necessite a small trick that’s useful to know in itself because it allows you to use all the TF 1.x modules in TF hub in TF 2.0 as a Keras layer. \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip freeze | grep tensorflow-hub==0.7.0 || pip install tensorflow-hub==0.7.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import tensorflow as tf\n",
    "import tensorflow_hub as hub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace by your GCP project and bucket:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"qwiklabs-gcp-03-3247cf88ddb1\" # \"your-gcp-project-here\" # REPLACE WITH YOUR PROJECT NAME\n",
    "BUCKET = \"qwiklabs-gcp-03-3247cf88ddb1\" #\"your-gcp-bucket-here\" # REPLACE WITH YOUR BUCKET NAME\n",
    "\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"BUCKET\"] = BUCKET"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the dataset in GCS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The corpus we chose is one of [Project Gutenberg medical texts](http://www.gutenberg.org/ebooks/bookshelf/48): [A Manual of the Operations of Surgery](http://www.gutenberg.org/ebooks/24564) by Joseph Bell, containing very specialized language. \n",
    "\n",
    "The first thing to do is to upload the text into a GCS bucket:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "URL=https://www.gutenberg.org/cache/epub/24564/pg24564.txt\n",
    "OUTDIR=gs://$BUCKET/custom_embedding\n",
    "CORPUS=surgery_manual.txt\n",
    "\n",
    "curl $URL > $CORPUS\n",
    "gcloud storage cp $CORPUS $OUTDIR/$CORPUS"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It has very specialized language such as \n",
    "\n",
    "```\n",
    "On the surface of the abdomen the position of this vessel would be \n",
    "indicated by a line drawn from about an inch on either side of the \n",
    "umbilicus to the middle of the space between the symphysis pubis \n",
    "and the crest of the ilium.\n",
    "```\n",
    "\n",
    "Now let's go over the steps involved in creating your own embedding from that corpus."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Download the `text2hub` pipeline from AI Hub (TODO 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go on [AI Hub](https://aihub.cloud.google.com/u/0/) and search for the `text2hub` pipeline, or just follow [this link](https://aihub.cloud.google.com/u/0/p/products%2F4a91d2d0-1fb8-4e79-adf7-a35707071195).\n",
    "You'll land onto a page describing `text2hub`. Click on the \"Download\" button on that page to download the Kubeflow pipeline and click `Accept`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](./assets/text2hub_download.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The text2hub pipeline is a KubeFlow pipeline that comprises three components; namely:\n",
    "\n",
    "\n",
    "* The **text2cooc** component that computes a word co-occurrence matrix\n",
    "from a corpus of text\n",
    "\n",
    "* The **cooc2emb** component that factorizes the\n",
    "co-occurrence matrix using [Swivel](https://arxiv.org/pdf/1602.02215.pdf) into\n",
    "the word embeddings exported as a tsv file\n",
    "\n",
    "* The **emb2hub** component that takes the word\n",
    "embedding file and generates a TF Hub module from it\n",
    "\n",
    "\n",
    "Each component is implemented as a Docker container image that's stored into Google Cloud Docker registry, [gcr.io](https://cloud.google.com/container-registry/). The `pipeline.tar.gz` file that you downloaded is a yaml description of how these containers need to be composed as well as where to find the corresponding images. \n",
    "\n",
    "**Remark:** Each component can be run individually as a single component pipeline in exactly the same manner as the `text2hub` pipeline. On AI Hub, each component has a pipeline page describing it and from where you can download the associated single component pipeline:\n",
    "\n",
    " * [text2cooc](https://aihub.cloud.google.com/u/0/p/products%2F6d998d56-741e-4154-8400-0b3103f2a9bc)\n",
    " * [cooc2emb](https://aihub.cloud.google.com/u/0/p/products%2Fda367ed9-3d70-4ca6-ad14-fd6bf4a913d9)\n",
    " * [emb2hub](https://aihub.cloud.google.com/u/0/p/products%2F1ef7e52c-5da5-437b-a061-31111ab55312)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Upload the pipeline to the Kubeflow cluster (TODO 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Go to your [Kubeflow cluster dashboard](https://console.cloud.google.com/ai-platform/pipelines/clusters) or navigate to `Navigation menu > AI Platform > Pipelines` and click `Open Pipelines Dashboard` then click on the `Pipelines` tab to create a new pipeline. You'll be prompted to upload the pipeline file you have just downloaded, click `Upload Pipeline`. Rename the generated pipeline name to be `text2hub` to keep things nice and clean. Click `Create`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](./assets/text2hub_upload1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Create a pipeline run (TODO 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After uploading the pipeline, you should see `text2hub` appear on the pipeline list. Click on it. This will bring you to a page describing the pipeline (explore!) and allowing you to create a run. You can inspect the input and output parameters of each of the pipeline components by clicking on the component node in the graph representing the pipeline. Click `Create Run`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](./assets/text2hub_run_creation.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Enter the run parameters (TODO 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`text2hub` has the following run parameters you can configure:\n",
    "\n",
    "Argument                                         | Description                                                                           | Optional | Data Type | Accepted values | Default\n",
    "------------------------------------------------ | ------------------------------------------------------------------------------------- | -------- | --------- | --------------- | -------\n",
    "gcs-path-to-the-text-corpus                      | A Cloud Storage location pattern (i.e., glob) where the text corpus will be read from | False    | String    | gs://...        | -\n",
    "gcs-directory-path-for-pipeline-output           | A Cloud Storage directory path where the pipeline output will be exported             | False    | String    | gs://...        | -\n",
    "number-of-epochs                                 | Number of epochs to train the embedding algorithm (Swivel) on                         | True     | Integer   | -               | 40\n",
    "embedding-dimension                              | Number of components of the generated embedding vectors                               | True     | Integer   | -               | 128\n",
    "co-occurrence-word-window-size                   | Size of the sliding word window where co-occurrences are extracted from               | True     | Integer   | -               | 10\n",
    "number-of-out-of-vocabulary-buckets              | Number of out-of-vocabulary buckets                                                   | True     | Integer   | -               | 1\n",
    "minimum-occurrences-for-a-token-to-be-considered | Minimum number of occurrences for a token to be included in the vocabulary            | True     | Integer   | -               | 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can leave most parameters with their default values except for\n",
    "`gcs-path-to-the-test-corpus` whose value should be set to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo gs://$BUCKET/custom_embedding/surgery_manual.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and for `gcs-directory-path-for-pipeline-output` which we will set to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo gs://$BUCKET/custom_embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Remark**: `gcs-path-to-the-test-corpus` will accept a GCS pattern like `gs://BUCKET/data/*.txt` or simply a path like `gs://BUCKET/data/` to a GCS directory. All the files that match the pattern or that are in that directory will be parsed to create the word embedding TF-Hub module. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](./assets/text2hub_run_parameters1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make sure to choose experiment `default`. Once these values have been set, you can start the run by clicking on `Start`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Inspect the run artifacts (TODO 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the run has started you can see its state by going to the `Experiments` tab and clicking on the name of the run (here \"text2hub-1\"). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/text2hub_experiment_list1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It will show you the pipeline graph. The components in green have successfuly completed. You can then click on them and look at the artifacts that these components have produced.\n",
    "\n",
    "The `text2cooc` components has \"co-occurrence extraction summary\" showing you the GCS path where the co-occurrence data has been saved. Their is a corresponding link that you can paste into your browser to inspect the co-occurrence data from the GCS browser. Some statistics about the vocabulary are given to you such as the most and least frequent tokens. You can also download the vocabulary file containing the token to be embedded. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/text2cooc_markdown_artifacts-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `cooc2emb` has three artifacts\n",
    "* An \"Embedding Extraction Summary\" providing the information as where the model chekpoints and the embedding tables are exported to on GCP\n",
    "* A similarity matrix from a random sample of words giving you an indication whether the model associates close-by vectors to similar words\n",
    "* An button to start TensorBoard from the UI to inspect the model and visualize the word embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/cooc2emb_artifacts-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a look at the word embedding visualization provided by TensorBoard. Select the TF version: `TensorFlow 1.14.0`. Start TensorBoard by clicking on `Start Tensorboard` and then `Open Tensorboard` buttons, and then select \"Projector\".\n",
    "\n",
    "**Remark:** The projector tab may take some time to appear. If it takes too long it may be that your Kubeflow cluster is running an incompatible version of TensorBoard (your TB version should be between 1.13 and 1.15). If that's the case, just run Tensorboard from CloudShell or locally by issuing the following command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!echo tensorboard --port 8080 --logdir gs://$BUCKET/custom_embedding/embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The projector view will present you with a representation of the word vectors in a 3 dimensional space (the dim is reduced through PCA) that you can interact with. Enter in the search tool a few words like \"ilium\" and points in the 3D space will light up. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/cooc2emb_tb_search.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you click on a word vector, you'll see appear the n nearest neighbors of that word in the embedding space. The nearset neighbors are both visualized in the center panel and presented as a flat list on the right. \n",
    "\n",
    "Explore the nearest neighbors of a given word and see if they kind of make sense. This will give you a rough understanding of the embedding quality. If it nearest neighbors do not make sense after trying for a few key words, you may need rerun `text2hub`, but this time with either more epochs or more data. Reducing the embedding dimension may help as well as modifying the co-occurence window size (choose a size that make sense given how your corpus is split into lines.)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/cooc2emb_nn.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `emb2hub` artifacts give you a snippet of TensorFlow 1.x code showing you how to re-use the generated TF-Hub module in your code. We will demonstrate how to use the TF-Hub module in TF 2.0 in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![img](assets/emb2hub_artifacts-2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Step 7: Using the generated TF-Hub module (TODO)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see now how to load the TF-Hub module generated by `text2hub` in TF 2.0.\n",
    "\n",
    "We first store the GCS bucket path where the TF-Hub module has been exported into a variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MODULE = \"gs://{bucket}/custom_embedding/hub-module\".format(bucket=BUCKET)\n",
    "MODULE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are ready to create a `KerasLayer` out of our custom text embedding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "med_embed = hub.KerasLayer(MODULE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That layer when called with a list of sentences will create a sentence vector for each sentence by averaging the word vectors of the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = med_embed(tf.constant(['ilium', 'I have a fracture', 'aneurism']))\n",
    "outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you use a version of TensorFlow Hub smaller than `tensorflow-hub==0.7.0`, then you'll need to use the following wrapper to instanciate the `KerasLayer`:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "class Wrapper(tf.train.Checkpoint):\n",
    "    def __init__(self, spec):\n",
    "        super(Wrapper, self).__init__()\n",
    "        self.module = hub.load(spec)\n",
    "        self.variables = self.module.variables\n",
    "        self.trainable_variables = []\n",
    "    def __call__(self, x):\n",
    "        return self.module.signatures[\"default\"](x)[\"default\"]\n",
    "    \n",
    "med_embed = hub.KerasLayer(Wrapper(MODULE))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "environment": {
   "name": "tf2-gpu.2-1.m68",
   "type": "gcloud",
   "uri": "gcr.io/deeplearning-platform-release/tf2-gpu.2-1:m68"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
